Student’s t-test (one-sample, paired, two-sample)#

A t-test answers a very specific question:

Is a mean (or mean difference) far enough from a reference value that it would be surprising if the true mean were actually that reference value?

It’s the default tool for mean comparisons when:

  • you have numeric data (measurements, scores, times, amounts)

  • the population standard deviation \(\sigma\) is unknown (almost always)

  • sample sizes are small to medium, so estimating uncertainty matters


Learning goals#

By the end you should be able to:

  • pick the right t-test variant (one-sample vs paired vs two-sample)

  • compute the t statistic and degrees of freedom by hand

  • understand and interpret p-values and confidence intervals

  • implement the test end-to-end using NumPy only (no SciPy)

  • use interactive Plotly visuals to build intuition

import math

import numpy as np
import plotly.express as px
import plotly.graph_objects as go
import os
import plotly.io as pio

pio.templates.default = "plotly_white"
pio.renderers.default = os.environ.get("PLOTLY_RENDERER", "notebook")

np.set_printoptions(precision=4, suppress=True)

rng = np.random.default_rng(42)

1) When do you use a t-test?#

Use a t-test when you want to compare means:

  • One-sample: “Is the mean different from a baseline?” (e.g., average delivery time vs SLA)

  • Paired: “Did the mean change within the same units?” (e.g., before/after intervention on the same users)

  • Two-sample: “Are the group means different?” (e.g., A/B test: variant B vs A)

A t-test is not a general “difference detector”. It’s not ideal when:

  • the data are extremely skewed or dominated by outliers

  • you care about medians or quantiles, not means

  • observations are dependent (e.g., time series without adjustments)

If those issues apply, consider robust or nonparametric alternatives (permutation tests, Wilcoxon/Mann–Whitney, trimmed-mean tests).

2) The three t-tests (same idea, different setup)#

All t-tests have the same structure:

  1. define a null hypothesis \(H_0\) about a mean (or mean difference)

  2. compute a t statistic = (estimate − null value) / (estimated standard error)

  3. use the Student t distribution to measure how “extreme” that statistic is

Quick reference#

Test

What is the parameter?

Typical \(H_0\)

One-sample

mean \(\mu\)

\(\mu = \mu_0\)

Paired

mean difference \(\mu_d\)

\(\mu_d = 0\)

Two-sample (Welch)

mean difference \(\mu_1 - \mu_2\)

\(\mu_1 - \mu_2 = 0\)

Rule of thumb: for two independent groups, prefer Welch’s t-test (it does not assume equal variances).

3) Hypothesis testing vocabulary (and what it actually means)#

  • \(H_0\) (null): the “no effect / no difference” statement you test.

  • \(H_1\) (alternative): what you’ll believe if the data are inconsistent with \(H_0\).

  • \(\alpha\) (significance level): a chosen false-positive rate (often 0.05).

  • p-value: under \(H_0\), the probability of observing a test statistic at least as extreme as what you got.

Important interpretation points:

  • A p-value is not “the probability \(H_0\) is true”.

  • A small p-value means the data are unlikely under \(H_0\).

  • A large p-value does not prove “no effect”; it may mean “not enough evidence”.

A good habit: always pair the p-value with a confidence interval (CI) and an effect size.

4) Why the t distribution appears#

If the population standard deviation \(\sigma\) were known, the standardized mean

\[ Z = \frac{\bar{X} - \mu_0}{\sigma / \sqrt{n}} \]

would follow a standard normal distribution under \(H_0\) (for normal data).

In reality, \(\sigma\) is unknown, so we estimate it with the sample standard deviation \(s\). Replacing \(\sigma\) with \(s\) injects extra uncertainty, and the distribution becomes heavier-tailed.

That heavier-tailed distribution is Student’s t, controlled by the degrees of freedom (df).

  • small df → heavier tails (more uncertainty)

  • large df → approaches Normal(0, 1)

# Student t density (no SciPy)

def t_pdf(t: np.ndarray, df: float) -> np.ndarray:
    t = np.asarray(t, dtype=float)
    df = float(df)

    # log form for numerical stability
    log_coeff = (
        math.lgamma((df + 1) / 2)
        - 0.5 * math.log(df * math.pi)
        - math.lgamma(df / 2)
    )
    return np.exp(log_coeff) * (1 + (t**2) / df) ** (-(df + 1) / 2)


x = np.linspace(-6, 6, 2000)

fig = go.Figure()
for df in [1, 2, 5, 10, 30, 100]:
    fig.add_trace(go.Scatter(x=x, y=t_pdf(x, df), mode="lines", name=f"t(df={df})"))

normal_pdf = (1 / np.sqrt(2 * np.pi)) * np.exp(-(x**2) / 2)
fig.add_trace(
    go.Scatter(
        x=x,
        y=normal_pdf,
        mode="lines",
        name="Normal(0,1)",
        line=dict(color="black", dash="dash"),
    )
)

fig.update_layout(
    title="Student's t distribution: heavier tails for small df",
    xaxis_title="t",
    yaxis_title="density",
)
fig.show()

5) The one-sample t-test (the core idea)#

Given observations \(x_1, \dots, x_n\):

  • sample mean: \(\bar{x}\)

  • sample standard deviation:

\[ s = \sqrt{\frac{1}{n-1} \sum_{i=1}^n (x_i - \bar{x})^2} \]

The standard error (estimated std of the sample mean) is:

\[ \text{SE}(\bar{x}) = \frac{s}{\sqrt{n}} \]

The t statistic for testing \(H_0: \mu = \mu_0\) is:

\[ t = \frac{\bar{x} - \mu_0}{s / \sqrt{n}} \]

Under \(H_0\) (and normal data),

\[ t \sim t_{\text{df}=n-1} \]

Interpretation: \(t\) is “how many estimated standard errors away from \(\mu_0\) the sample mean is”.

6) Implementation from scratch (NumPy-only)#

We’ll implement:

  • one-sample t-test

  • two-sample t-test (pooled-variance and Welch)

  • paired t-test (as a one-sample test on differences)

To keep everything transparent, we’ll also compute:

  • p-values via numerical integration of the t density

  • critical values / confidence intervals via inverting the CDF with bisection

This is for learning. In production, use a dedicated stats library for speed and edge cases.

def t_cdf_numeric(t: float, df: float) -> float:
    # Numerical CDF via symmetry + trapezoidal integration (no SciPy).
    t = float(t)
    df = float(df)

    if t == 0.0:
        return 0.5

    sign = 1.0 if t > 0 else -1.0
    t_abs = abs(t)

    # Adaptive resolution: more points for larger |t|.
    n = int(np.clip(np.ceil(5000 * t_abs), 4000, 200000))
    grid = np.linspace(0.0, t_abs, n)
    area = np.trapz(t_pdf(grid, df), grid)

    return float(0.5 + sign * area)


def t_p_value(t_stat: float, df: float, alternative: str = "two-sided") -> float:
    cdf = t_cdf_numeric(t_stat, df)

    if alternative == "two-sided":
        return float(2 * min(cdf, 1 - cdf))
    if alternative == "greater":
        return float(1 - cdf)
    if alternative == "less":
        return float(cdf)

    raise ValueError("alternative must be 'two-sided', 'greater', or 'less'")


def t_ppf_numeric(p: float, df: float, *, tol: float = 1e-6, max_iter: int = 80) -> float:
    # Inverse CDF via bisection (sufficient for CI/critical values).
    p = float(p)
    df = float(df)

    if not (0.0 < p < 1.0):
        raise ValueError("p must be in (0, 1)")
    if p == 0.5:
        return 0.0

    # Symmetry: solve on positive side.
    sign = 1.0 if p > 0.5 else -1.0
    p_target = p if p > 0.5 else 1.0 - p

    lo, hi = 0.0, 50.0
    while t_cdf_numeric(hi, df) < p_target:
        hi *= 2
        if hi > 1e6:
            raise RuntimeError("Failed to bracket quantile")

    for _ in range(max_iter):
        mid = 0.5 * (lo + hi)
        if t_cdf_numeric(mid, df) < p_target:
            lo = mid
        else:
            hi = mid
        if (hi - lo) < tol:
            break

    return sign * 0.5 * (lo + hi)


def ci_two_sided(estimate: float, se: float, df: float, alpha: float = 0.05) -> tuple[float, float]:
    t_crit = t_ppf_numeric(1 - alpha / 2, df)
    return (estimate - t_crit * se, estimate + t_crit * se)


def one_sample_t_test(
    x: np.ndarray,
    *,
    mu0: float = 0.0,
    alternative: str = "two-sided",
    alpha: float = 0.05,
) -> dict:
    x = np.asarray(x, dtype=float)
    if x.ndim != 1 or x.size < 2:
        raise ValueError("x must be a 1D array with at least 2 values")

    n = x.size
    mean = float(x.mean())
    s = float(x.std(ddof=1))
    se = s / np.sqrt(n)

    estimate = mean - mu0
    t_stat = estimate / se
    df = float(n - 1)

    p = t_p_value(t_stat, df, alternative=alternative)
    ci = ci_two_sided(estimate, se, df, alpha=alpha)

    return {
        "test": "one-sample t-test",
        "n": int(n),
        "df": df,
        "mu0": float(mu0),
        "estimate": float(estimate),
        "t_stat": float(t_stat),
        "p_value": float(p),
        "ci": tuple(map(float, ci)),
        "reject_at_alpha": bool(p <= alpha),
    }


def two_sample_t_test(
    x: np.ndarray,
    y: np.ndarray,
    *,
    diff0: float = 0.0,
    equal_var: bool = False,
    alternative: str = "two-sided",
    alpha: float = 0.05,
) -> dict:
    x = np.asarray(x, dtype=float)
    y = np.asarray(y, dtype=float)
    if x.ndim != 1 or y.ndim != 1 or x.size < 2 or y.size < 2:
        raise ValueError("x and y must be 1D arrays with at least 2 values each")

    n1, n2 = x.size, y.size
    m1, m2 = float(x.mean()), float(y.mean())
    s1, s2 = float(x.std(ddof=1)), float(y.std(ddof=1))

    raw_diff = m1 - m2
    estimate = raw_diff - diff0

    if equal_var:
        sp2 = ((n1 - 1) * s1**2 + (n2 - 1) * s2**2) / (n1 + n2 - 2)
        se = float(np.sqrt(sp2 * (1 / n1 + 1 / n2)))
        df = float(n1 + n2 - 2)
        variant = "two-sample t-test (pooled variance)"
    else:
        se2 = s1**2 / n1 + s2**2 / n2
        se = float(np.sqrt(se2))
        df = float(se2**2 / ((s1**2 / n1) ** 2 / (n1 - 1) + (s2**2 / n2) ** 2 / (n2 - 1)))
        variant = "two-sample t-test (Welch)"

    t_stat = estimate / se
    p = t_p_value(t_stat, df, alternative=alternative)
    ci = ci_two_sided(estimate, se, df, alpha=alpha)

    return {
        "test": variant,
        "n1": int(n1),
        "n2": int(n2),
        "df": float(df),
        "diff0": float(diff0),
        "estimate": float(estimate),
        "t_stat": float(t_stat),
        "p_value": float(p),
        "ci": tuple(map(float, ci)),
        "reject_at_alpha": bool(p <= alpha),
        "means": (m1, m2),
        "stds": (s1, s2),
    }


def paired_t_test(
    before: np.ndarray,
    after: np.ndarray,
    *,
    diff0: float = 0.0,
    alternative: str = "two-sided",
    alpha: float = 0.05,
) -> dict:
    before = np.asarray(before, dtype=float)
    after = np.asarray(after, dtype=float)

    if before.shape != after.shape or before.ndim != 1 or before.size < 2:
        raise ValueError("before and after must be 1D arrays of the same length (>= 2)")

    d = after - before
    res = one_sample_t_test(d, mu0=diff0, alternative=alternative, alpha=alpha)
    res["test"] = "paired t-test (one-sample on differences)"
    res["mean_before"] = float(before.mean())
    res["mean_after"] = float(after.mean())
    return res


def plot_t_p_value(
    t_stat: float,
    df: float,
    *,
    alternative: str = "two-sided",
    title: str | None = None,
):
    x_max = max(6.0, abs(float(t_stat)) * 1.6 + 1.0)
    x = np.linspace(-x_max, x_max, 3000)
    y = t_pdf(x, df)

    fig = go.Figure()
    fig.add_trace(go.Scatter(x=x, y=y, mode="lines", name=f"t(df={df:.2f})"))

    fill_color = "rgba(200, 30, 30, 0.25)"

    if alternative == "two-sided":
        cut = abs(float(t_stat))
        masks = [x <= -cut, x >= cut]
        for i, mask in enumerate(masks):
            fig.add_trace(
                go.Scatter(
                    x=np.concatenate([x[mask], x[mask][::-1]]),
                    y=np.concatenate([y[mask], np.zeros_like(y[mask])]),
                    fill="toself",
                    mode="lines",
                    line=dict(color="rgba(0,0,0,0)"),
                    fillcolor=fill_color,
                    name="p-value area" if i == 0 else None,
                    showlegend=(i == 0),
                )
            )
        fig.add_vline(x=cut, line_dash="dash", line_color="rgba(200, 30, 30, 0.9)")
        fig.add_vline(x=-cut, line_dash="dash", line_color="rgba(200, 30, 30, 0.9)")

    elif alternative == "greater":
        cut = float(t_stat)
        mask = x >= cut
        fig.add_trace(
            go.Scatter(
                x=np.concatenate([x[mask], x[mask][::-1]]),
                y=np.concatenate([y[mask], np.zeros_like(y[mask])]),
                fill="toself",
                mode="lines",
                line=dict(color="rgba(0,0,0,0)"),
                fillcolor=fill_color,
                name="p-value area",
            )
        )
        fig.add_vline(x=cut, line_dash="dash", line_color="rgba(200, 30, 30, 0.9)")

    elif alternative == "less":
        cut = float(t_stat)
        mask = x <= cut
        fig.add_trace(
            go.Scatter(
                x=np.concatenate([x[mask], x[mask][::-1]]),
                y=np.concatenate([y[mask], np.zeros_like(y[mask])]),
                fill="toself",
                mode="lines",
                line=dict(color="rgba(0,0,0,0)"),
                fillcolor=fill_color,
                name="p-value area",
            )
        )
        fig.add_vline(x=cut, line_dash="dash", line_color="rgba(200, 30, 30, 0.9)")

    else:
        raise ValueError("alternative must be 'two-sided', 'greater', or 'less'")

    p = t_p_value(t_stat, df, alternative=alternative)
    fig.update_layout(
        title=title or f"Tail area = p-value (t={t_stat:.3f}, df={df:.2f}, p≈{p:.4f})",
        xaxis_title="t",
        yaxis_title="density",
    )
    return fig

7) Example: one-sample t-test#

Scenario: you measure a metric (say, average task completion time) and want to test whether the mean differs from a baseline \(\mu_0\).

We’ll generate a small synthetic sample and test \(H_0: \mu = \mu_0\).

# Synthetic data: mean slightly above baseline
mu0 = 10.0
n = 18
x = rng.normal(loc=10.8, scale=1.5, size=n)

res_1s = one_sample_t_test(x, mu0=mu0, alternative="two-sided", alpha=0.05)
res_1s
{'test': 'one-sample t-test',
 'n': 18,
 'df': 17.0,
 'mu0': 10.0,
 'estimate': 0.6760694827820508,
 't_stat': 2.1448991639983137,
 'p_value': 0.04670272951904608,
 'ci': (0.011058309525378007, 1.3410806560387236),
 'reject_at_alpha': True}
fig = plot_t_p_value(res_1s["t_stat"], res_1s["df"], title="One-sample t-test: p-value as tail area")
fig.show()
mean = float(np.mean(x))
ci_low, ci_high = res_1s["ci"]

fig = px.histogram(
    x=x,
    nbins=12,
    title="Raw observations (with baseline and estimated mean)",
    labels={"x": "value"},
)
fig.add_vline(x=mu0, line_dash="dash", line_color="black", annotation_text="μ₀")
fig.add_vline(x=mean, line_dash="solid", line_color="rgba(200,30,30,0.9)", annotation_text="x̄")

fig2 = go.Figure()
fig2.add_trace(
    go.Scatter(
        x=[res_1s["estimate"]],
        y=["mean − μ₀"],
        mode="markers",
        marker=dict(size=10),
        error_x=dict(
            type="data",
            symmetric=False,
            array=[ci_high - res_1s["estimate"]],
            arrayminus=[res_1s["estimate"] - ci_low],
        ),
    )
)
fig2.add_vline(x=0.0, line_dash="dash", line_color="black")
fig2.update_layout(
    title="Estimated mean difference with 95% CI",
    xaxis_title="difference",
    yaxis_title="",
)

fig.show()
fig2.show()

Interpreting the one-sample result#

  • If p_value <= α, you reject \(H_0\) at level \(\alpha\).

  • The CI tells you a range of plausible mean differences (mean − \(\mu_0\)).

A useful equivalence (two-sided tests):

Rejecting \(H_0: \mu=\mu_0\) at level \(\alpha\) is the same as the \((1-\alpha)\) CI for \(\mu\) not containing \(\mu_0\).

8) Example: two-sample t-test (Welch vs pooled)#

Two independent groups (A and B). We want to test:

\[ H_0: \mu_A - \mu_B = 0 \]

Welch vs pooled-variance (Student) t-test#

  • Welch (recommended default): does not assume equal variances.

  • Pooled variance: assumes the two populations have the same variance.

If you’re unsure, use Welch.

# Synthetic A/B data with different variances (Welch is appropriate)
n_a, n_b = 24, 20
A = rng.normal(loc=0.0, scale=1.0, size=n_a)
B = rng.normal(loc=0.55, scale=1.5, size=n_b)

res_welch = two_sample_t_test(A, B, equal_var=False)
res_pooled = two_sample_t_test(A, B, equal_var=True)

res_welch, res_pooled
({'test': 'two-sample t-test (Welch)',
  'n1': 24,
  'n2': 20,
  'df': 30.897981559364187,
  'diff0': 0.0,
  'estimate': -0.3018367277038695,
  't_stat': -0.97827544805771,
  'p_value': 0.33553605768968287,
  'ci': (-0.9311915269948647, 0.32751807158712576),
  'reject_at_alpha': False,
  'means': (0.1793406145972717, 0.48117734230114123),
  'stds': (0.7546575962141199, 1.1955516622986013)},
 {'test': 'two-sample t-test (pooled variance)',
  'n1': 24,
  'n2': 20,
  'df': 42.0,
  'diff0': 0.0,
  'estimate': -0.3018367277038695,
  't_stat': -1.0182971425705634,
  'p_value': 0.3143669449743095,
  'ci': (-0.9000228288264907, 0.29634937341875167),
  'reject_at_alpha': False,
  'means': (0.1793406145972717, 0.48117734230114123),
  'stds': (0.7546575962141199, 1.1955516622986013)})
values = np.concatenate([A, B])
groups = np.array(["A"] * A.size + ["B"] * B.size)

fig = px.violin(
    x=groups,
    y=values,
    box=True,
    points="all",
    title="Two groups: distribution view (A vs B)",
    labels={"x": "group", "y": "value"},
)
fig.show()

ci_low, ci_high = res_welch["ci"]
fig2 = go.Figure()
fig2.add_trace(
    go.Scatter(
        x=[res_welch["estimate"]],
        y=["A − B"],
        mode="markers",
        marker=dict(size=10),
        error_x=dict(
            type="data",
            symmetric=False,
            array=[ci_high - res_welch["estimate"]],
            arrayminus=[res_welch["estimate"] - ci_low],
        ),
    )
)
fig2.add_vline(x=0.0, line_dash="dash", line_color="black")
fig2.update_layout(
    title="Estimated mean difference (Welch) with 95% CI",
    xaxis_title="difference",
    yaxis_title="",
)
fig2.show()
fig = plot_t_p_value(res_welch["t_stat"], res_welch["df"], title="Welch t-test: p-value as tail area")
fig.show()

9) Example: paired t-test#

Paired data means each “before” is linked to an “after” for the same unit (person, device, country, …).

The trick: compute the difference per pair and run a one-sample t-test on those differences.

\[ d_i = \text{after}_i - \text{before}_i \]

Then test \(H_0: \mu_d = 0\).

# Synthetic paired data: the same units measured twice
n = 18
before = rng.normal(loc=50, scale=8, size=n)
# after is correlated with before + an average improvement
after = before + rng.normal(loc=2.0, scale=3.0, size=n)

res_paired = paired_t_test(before, after)
res_paired
{'test': 'paired t-test (one-sample on differences)',
 'n': 18,
 'df': 17.0,
 'mu0': 0.0,
 'estimate': 0.903508293293421,
 't_stat': 1.7405904074002199,
 'p_value': 0.0998231022246423,
 'ci': (-0.19165794524122026, 1.9986745318280623),
 'reject_at_alpha': False,
 'mean_before': 50.07036480566932,
 'mean_after': 50.97387309896274}
fig = go.Figure()
for i in range(n):
    fig.add_trace(
        go.Scatter(
            x=["before", "after"],
            y=[before[i], after[i]],
            mode="lines+markers",
            line=dict(color="rgba(0,0,0,0.25)"),
            marker=dict(size=6),
            showlegend=False,
        )
    )

fig.update_layout(
    title="Paired measurements: each line is one unit",
    xaxis_title="",
    yaxis_title="value",
)
fig.show()

diffs = after - before
fig2 = px.histogram(
    x=diffs,
    nbins=12,
    title="Paired differences (after − before)",
    labels={"x": "difference"},
)
fig2.add_vline(x=0.0, line_dash="dash", line_color="black")
fig2.show()
fig = plot_t_p_value(res_paired["t_stat"], res_paired["df"], title="Paired t-test: p-value as tail area")
fig.show()

10) Interpreting the output (t, df, p, CI)#

t statistic#

\[ t = \frac{\text{estimate} - \text{null value}}{\text{estimated standard error}} \]
  • large \(|t|\) → estimate is many standard errors away from the null

degrees of freedom (df)#

df controls tail heaviness. Smaller df means more uncertainty because you estimated more from limited data.

p-value#

  • two-sided: “How likely is a value with \(|T| \ge |t_{obs}|\) under \(H_0\)?”

  • one-sided: tail probability only on the direction you care about

confidence interval#

For two-sided tests, the \((1-\alpha)\) CI and the test are consistent:

  • if the CI excludes 0 (or \(\mu_0\)), you reject at level \(\alpha\)

  • if the CI includes 0 (or \(\mu_0\)), you don’t reject

practical vs statistical significance#

A tiny effect can be statistically significant with a large sample; a meaningful effect can be non-significant with a small sample. Always look at the effect size + CI.

alpha = 0.05

def ci_excludes_zero(ci: tuple[float, float]) -> bool:
    lo, hi = ci
    return (lo > 0.0) or (hi < 0.0)

checks = {
    "one_sample": {
        "p<=alpha": res_1s["p_value"] <= alpha,
        "ci_excludes_0": ci_excludes_zero(res_1s["ci"]),
    },
    "welch": {
        "p<=alpha": res_welch["p_value"] <= alpha,
        "ci_excludes_0": ci_excludes_zero(res_welch["ci"]),
    },
    "paired": {
        "p<=alpha": res_paired["p_value"] <= alpha,
        "ci_excludes_0": ci_excludes_zero(res_paired["ci"]),
    },
}
checks
{'one_sample': {'p<=alpha': True, 'ci_excludes_0': True},
 'welch': {'p<=alpha': False, 'ci_excludes_0': False},
 'paired': {'p<=alpha': False, 'ci_excludes_0': False}}
# Under a true null, p-values should be ~Uniform(0,1).
# We'll simulate many one-sample t-tests where H0 is true.

n = 12
n_sim = 400
mu0 = 0.0

pvals = []
for _ in range(n_sim):
    sample = rng.normal(loc=0.0, scale=1.0, size=n)
    mean = float(sample.mean())
    s = float(sample.std(ddof=1))
    se = s / np.sqrt(n)
    t_stat = (mean - mu0) / se
    pvals.append(t_p_value(t_stat, n - 1, alternative="two-sided"))

fig = px.histogram(
    x=pvals,
    nbins=20,
    title="p-values under a true null are (approximately) uniform",
    labels={"x": "p-value"},
)
fig.add_hline(y=n_sim / 20, line_dash="dash", line_color="black", annotation_text="uniform baseline")
fig.show()

11) Assumptions, diagnostics, and pitfalls#

Assumptions (typical)#

  • Independence of observations (or of paired differences)

  • The sampling distribution of the mean is close enough to normal:

    • data are approximately normal, or

    • sample size is large enough for the CLT to kick in

Common pitfalls#

  • choosing one-sided vs two-sided after seeing the data

  • stopping data collection when p < 0.05 (optional stopping) without correction

  • multiple testing without adjustment (p-hacking)

  • ignoring effect size and uncertainty (CI)

Alternatives when assumptions fail#

  • Permutation test on the mean difference (very interpretable)

  • Bootstrap CI for the mean / mean difference

  • Wilcoxon signed-rank (paired) or Mann–Whitney U (two-sample) for ordinal / heavy-tailed cases

# Insight: as n grows, the same effect becomes easier to detect.

alpha = 0.05
true_effect = 0.3  # mean shift in SD units
n_grid = np.array([5, 8, 12, 20, 30, 50, 80])

n_sim = 3000

powers = []
false_pos = []

for n in n_grid:
    df = n - 1
    t_crit = t_ppf_numeric(1 - alpha / 2, df)

    # Under H0 (no effect)
    x0 = rng.normal(loc=0.0, scale=1.0, size=(n_sim, n))
    m0 = x0.mean(axis=1)
    s0 = x0.std(axis=1, ddof=1)
    t0 = m0 / (s0 / np.sqrt(n))
    false_pos.append(float(np.mean(np.abs(t0) >= t_crit)))

    # Under H1 (effect)
    x1 = rng.normal(loc=true_effect, scale=1.0, size=(n_sim, n))
    m1 = x1.mean(axis=1)
    s1 = x1.std(axis=1, ddof=1)
    t1 = m1 / (s1 / np.sqrt(n))
    powers.append(float(np.mean(np.abs(t1) >= t_crit)))

fig = go.Figure()
fig.add_trace(go.Scatter(x=n_grid, y=false_pos, mode="lines+markers", name="Type I error (H0)"))
fig.add_trace(go.Scatter(x=n_grid, y=powers, mode="lines+markers", name=f"Power (effect={true_effect})"))

fig.add_hline(y=alpha, line_dash="dash", line_color="black", annotation_text="α")

fig.update_layout(
    title="Sample size vs false-positive rate and power (two-sided t-test)",
    xaxis_title="sample size n",
    yaxis_title="probability",
    yaxis=dict(range=[0, 1]),
)
fig.show()

12) Exercises#

  1. Implement a permutation test for the two-sample mean difference and compare its p-value to Welch’s t-test.

  2. Create data with one huge outlier and see how the t-test changes. Try a trimmed mean instead.

  3. For the paired case, simulate scenarios where pairing helps a lot vs not at all (correlation between before/after).

References#

  • Student (1908), The probable error of a mean (original t-test paper)

  • Any intro stats text: sections on Student’s t distribution and t-tests